# Lab 10 - Logistic Regression

The Challenger Space Shuttle tragically explored in 1986, killing all astronauts on board. The explosion was shown to have been caused by an O-ring failure, likely due to cold temperatures the day of the launch (and also poor engineering that allowed this failure to cause such catastrophy).

This lab will use experimental data from tests on whether O-rings failed at different temperatures. The data set can be downloaded [here](http://comet.lehman.cuny.edu/owen/teaching/mat328/chall.txt)

Some of this lab is based off the Harvard Data Science CS109 Lab 4, Fall 2015.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import statsmodels.formula.api as smf


%matplotlib inline

Read the data into a dataframe.

Create a scatter plot with temperature on the x axis and failure on the y axis. What do you notice?

We will now use statsmodel to fit a logistic regression model to the data. Notice that the code is similar to when we fit a linear regression model to the data.

In [None]:
logit_model = smf.logit('Failure ~ Temperature',data).fit()
logit_model.summary()

Is there an R-squared value in the summary? What is the formula for the model?

There is another way to get the model parameters:

In [None]:
logit_model.params

We can use these parameters to graph the model equation on the data. 

First, create 200 evenly spaced x values (look at the data to see what their range should be): 

Next, we can compute $\beta_0 + \beta_1 x$ for all of these x values:

In [None]:
p = logit_model.params
reg = p['Intercept'] + x*p['Temperature']

Finally we can plug `reg` into the logistic equation to get the y values:

In [None]:
y = np.exp(reg)/(1 + np.exp(reg))

Plot another scatter plot of the data, plus the plot of our calculated x and y values:

One way to understand how well our model works is to make a *confusion table* or *confusion matrix*, which counts how many of each type of error there are. We can create the table using the `pred_table()` function.

In [None]:
logit_model.pred_table()

The confusion matrix can be read as follows:
 
 predicted
 | 0 | 1 |
 --------------------------------
observed | 0 | true negative | false positive
 | 1 | false negative | true positive



How many correct predictions did the model make? What kind of wrong predictions did the model make?

# Pima (Akimel Oʼodham) Indian Diabetes data

The Akimel O'odham people, who were also known as the Pima Indians since European colonization of the US, currently have a high prevalence of diabetes. A data set of different possible diabetes indicators and whether the person has diabetes is on [Kaggle](ttps://www.kaggle.com/uciml/pima-indians-diabetes-database) or available [here](http://comet.lehman.cuny.edu/owen/teaching/mat328/diabetes.csv).

Read in the dataset.

Plot a scatter plot of glucose vs. diabetes.

Fit a logistic regression model to this data.

Plot the model equation on top of your scatter plot.

What does the confusion table tell you about the fit of this model?